Executive summary

This project deals with the subject of consumption of food and macronutrients in developed countries. The main goal of this work is to improve our understanding of the relationship between nutrition, health and well-being by analyzing food consumption patterns in 5 different sampled countries (USA, Italy, France, Germany and Switzerland). By examining three databases in depth, using associated variables, we intend to detect possible links between high levels of obesity in the population and the consumption of certain macronutrients. Additionally, we aim to develop an in-depth understanding of the most relevant ingredients in products for different countries and establish some key questions about the consumption of organic and non-organic food around the world. Finally, we present a regression model to explain the relationship between two economic indexes, namely HDI and Gross Income per Capita, and obesity rates over a wider range of countries

Our results indeed show the existence of differences in dietary patterns in the countries under consideration, mostly that the US have a higher consumption of sugars. In addition, we present evidence that may conduct to link the consumption of organic products with indicators of health and wealth for the examined countries. Our linear regression results show a significant positive relationship between HDI and obesity.

Introduction

1.1 Overview and motivation

During the past decades, there have been big changes regarding the consumption of food and macronutrients in several regions of the world (The Global Epidemic of Obesity: An Overview, 2007). Whilst during the beginning of the 1900s one of the biggest challenges in public health was underweight and malnutrition. Since the 1960s the prevalence of obesity and overweight has been spotted to be dramatically increasing (US Department of Health, 1966). Still, the issues regarding this matter show very different situations in European countries and North America. In the case of European countries, the mean values of Body Mass Index (BMI) show that (as a mean among countries) 40% of the population has a BMI above 25 and 11.5% above 30 (Gallus, S., et.al., 2015). It is important to acknowledge that if 25<BMI<30, the person is considered to be overweight, and a BMI>30, shows obesity. On the other hand, if we check the same values regarding the US population, we obtain that a 65% has a BMI above 25 and 30% above 30 (Hedley AA, Ogden CL, Johnson CL. 2004). During the past 40 years, Americans have been adhering to federal dietary guidelines that have promoted a reduction in the consumption of fat (Cohen, E., et al., 2015). In the following project, we would like to understand if the quality of food consumption in European countries fundamentally differs from the consumption of US Americans in modern times.

In terms of our personal motivations for choosing this theme, we are a group formed of people from different nationalities and backgrounds. Hence, we wanted to work on a project that could address an international scale topic. We understand that nutrition and eating habits are much more than only the food you consume. It is a source of strong differences among countries and it may have much more to say about cultural and economic struggles.

1.2 Research questions

  1. Is there a nutrition pattern in countries that have a strong prevalence of obesity?

  2. Do countries that consume more organic products present a consistent improvement of health indicators over time?

  3. Is it always true that highest economic indexes lead to better health conditions?

Data

2.1 Data source

2.1.1 Food Repository Dataset

We collected the data for our Food Repository Dataset on The Open Food Repo website (Digital Epidemiology Laboratory, 2021). The steps to get access to a usable dataframe are the following: (1) Create an API Key through the website using the API key and response URL to call the GET function and access the data; (2) Convert the raw bytes into JSON format data Converting the JSON format into a list and dataframe; (3) Loop through the API call to get access to all the pages of the data (1891 in total). As all the products didn’t have all the same variables, we filled the missing data in each column with the rbind.fill function; (4) The next steps will consist of cleaning the data and selecting the most useful variables for the overall analysis.

#This is how we extracted data through API 
result <- GET("https://www.foodrepo.org/api/v3/products?excludes=images&page%5Bsize%5D=200&page%5Bnumber%5D=1.links.next", add_headers(authorization ="Token token=70d5edb1123d8489a05e0d2a3fde1f8b"))
FoodRepoList <- rawToChar(result$content)
FoodRepoData <- fromJSON(FoodRepoList,flatten=TRUE)
FoodDF <- FoodRepoData$data

for (i in 2:1891) {
  result <- GET(paste("https://www.foodrepo.org/api/v3/products?excludes=images&page%5Bsize%5D=200&page%5Bnumber%5D=",i,".links.next"), add_headers(authorization ="Token token=70d5edb1123d8489a05e0d2a3fde1f8b"))
  FoodRepoList <- rawToChar(result$content)
  Data <- fromJSON(FoodRepoList,flatten=TRUE)
  DF <- Data$data
  FoodDF <- rbind.fill(FoodDF,DF)
}

FoodDF_new <- FoodDF %>% select(-ends_with("name_translations.fr"),-ends_with("name_translations.de"),-ends_with("name_translations.it"),-ends_with("translations.fr"),-ends_with("translations.de"),-ends_with("translations.it"))

unique(FoodDF_new$country)
save(FoodDF_new, file = "FoodRepo.Rda")  
Data description

This dataset consists of a list of products that have their origin on the Open Food Repo website. The Open Food Repo project encourages consumers from all over the world to use the application to upload the nutritional information of the barcoded products that they consume on a day-to-day basis. This organization aims to be a digital public resource for information about the food products that are available to individual consumers in different markets (Digital Epidemiology Laboratory, 2021).

By getting access to this information, the country from which the product comes from may be tracked. Along with this, different numeric and qualitative information regarding the macronutrients, ions, vitamins, and ingredients may be also addressed. In Table 1.1 we present a summary of the most relevant variables that are found in the database.

Table 1.1 Description of variables in the FoodRepo database *this variable contains a numeric main variable and _unit variable that shows the unit in which the number is expressed
Name Type Explanation Name Type Explanation
1 Alcohol_by_volume Numeric %Vol of alcohol (degree) 24 Phosphorus* Numeric Content per hundred units
2 Barcode Numeric Identifier in the market 25 Polyunsaturated_fatty_acids* Numeric Content per hundred units
3 Calcium* Numeric Content per hundred units 26 Portion_quantity Numeric Conten per portion
4 Carbohydrates* Numeric Content per hundred units 27 Portion_unit String Unit specifying the portion size
5 Copper_cu* Numeric Content per hundred units* 28 Potasium_k* Numeric Content per hundred units
6 Country String Origin of product 29 Product_name_en String Name of the product in english
7 created_at Date Date of creation 30 Protein* Numeric Content per hundred units
8 Energy_kcal* Numeric Caloric intake 31 Quantity Numeric Amount per package
9 Energy* Numeric Caloric intake 32 Salt* Numeric Content per hundred units
10 Fat* Numeric Content per hundred units 33 Saturated_fat* Numeric Content per hundred units
11 Fatty_acids_total_saturated* Numeric Content per hundred units 34 Sodium* Numeric Content per hundred units
12 Fatty_acids_total_trans* Numeric Content per hundred units 35 Sugars_added* Numeric Content per hundred units
13 Fiber_insoluble* Numeric Content per hundred units 36 Sugars* Numeric Content per hundred units
14 Fiber_soluble* Numeric Content per hundred units 37 Unit String Unit of amount per package (31)
15 Folate_total* Numeric Content per hundred units 38 Vitamin_a* Numeric Content per hundred units
16 Folic* Numeric Content per hundred units 39 Vitami_b12* Numeric Content per hundred units
17 Hundred_unit String Unit of the total “per hundred unit” per product 40 Vitamin_b1* Numeric Content per hundred units
18 id Numeric Identifier in the website 41 Vitamin_b3* Numeric Content per hundred units
19 ingredients_en String Description of ingredients and product nature 42 Vitamin_b5* Numeric Content per hundred units
20 iron* Numeric Content per hundred units 43 Vitamin_c* Numeric Content per hundred units
21 Magnesium* Numeric Content per hundred units 44 Vitamin_d_d2_d3* Numeric Content per hundred units
22 Manganese_mn* Numeric Content per hundred units 45 Zinc* Numeric Content per hundred units
23 Monounsaturated_fatty_acids* Numeric Content per hundred units
load(file='FoodRepo.Rda')

To thoroughly understand the information presented in Table 1.1. It is important to know that the macro nutrients, ions, and vitamins may be presented in different units. For instance, whilst some values are presented in milligrams [mg] (mass unit), others may be presented in milliliters [mL] (volume unit). Knowing this difference is relevant for making the analysis, avoid mixing up the units.

Additionally, it is interesting to point out that the variable ingredients_en, does not only contain ingredients, it also contains a small description of the product. If we analyze this section of the data, we can possibly get further information about the quality of the products in each market. Finally, the created_at variable may be interesting to analyze, for tracking seasonality in the consumption of certain products or nutrients.

2.1.2 World Obesity Dataset

To observe and compare obesity rates among the countries under consideration, we are consulting the “Prevalence of obesity among adults” data set provided by the World Health Organization (World Health Organization, 2022).

From this data set, we obtain obesity data from Switzerland, France, Germany, Italy, and the United States for an available time interval between 1983-2016. The exported data is manually reformatted into a proper form in Excel for further manipulation with R.

Data description

We use this data set to study the relationship between obesity values and the consumption of specific macro nutrients. Moreover, the available data makes it possible for us to show the dynamics of changes in obesity levels in the countries of focus over the last few decades, both for the total adult population and by sex. We select a subset with the country-specific data of our concern from the “Prevalence of obesity among adults” data set. Table 1.2. provides a description of the variables that may be found in the document.

The output describes the percentage of adults who are obese for an indicated year. As it was previously stated, people are considered obese if their Body Mass Index (BMI) is greater than or equal to 30.

Table 1.2 Description of variables in the World Obesity database
Name Type Explanation
1 Country String Country code of observation
2 Year Numeric Year of observation
3 Sex String Group of observation by sex
4 Obesity_rate Numeric % of the target population with BMI ≥ 30
#Extracting worldwide data 
ob_data_world <- read.csv("Obesity_data_world.csv")
names(ob_data_world)<- tolower(names(ob_data_world))

ob_data_world <- ob_data_world[,c("spatialdimvaluecode","location","period","dim1","factvaluenumeric")]

colnames(ob_data_world)[1]<- "code"
colnames(ob_data_world)[2]<- "country"
colnames(ob_data_world)[3]<- "year"
colnames(ob_data_world)[4]<- "sex"
colnames(ob_data_world)[5]<- "obesity_rate"

ob_data_world$obesity_rate <- ob_data_world$obesity_rate/100
OB_W <- filter(ob_data_world,sex=="Both sexes")
#OB_W$country <- gsub(" of America", "", OB_W$country)
#OB_W$country <- gsub(" of Great Britain and Northern Ireland", "", OB_W$country)

OB_W$year<-as.character(OB_W$year)

2.1.3 HDI Dataset

Finally, we use the compiled data set of the Human Development Indices (HDI) provided by the United Nations Development Program (United Nations, 2021). From this set we are able to extract information for our selected countries over a time period of 30 years (1990 to 2021).

As the data was provided as a .csv file, we were able to easily transform it into a dataframe for easy manipulation.

Data description

To understand the relationship between healthy nutrition and economic development, we used a supporting data set with information on the Human Development Index (HDI) for the countries in focus.

The HDI is a summary measure for assessing long-term progress in three basic dimensions: economic wealth, life expectancy and education.

For our research purposes, from the available data set on the United Nations website (United Nations, 2021), we will work with the variables listed in the Table 1.3.

Table 1.3 Description of variables in the HDI database
Name Type Explanation
1 Country String Country code of observation
2 Year Numeric Year of observation
3 HDI Numeric HDI value
4 Class String Human development classification
hdr_data <- read.csv("HDR.csv")

2.2 Data cleanup

2.2.1 Column drop

For starting this process let us take a look at our information. We want to better understand the quality of our columns and the information that we may actually work with.

In the following lines we will count the number of NA rows in each column. Afterwards, we will create a dataframe that contains this information for making visualization easier in the following step. Finally, using the mutate function we create an extra column that contains the total percentage of missing information for all of the 304 variables (columns).

na_count <-sapply(FoodDF_new, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
na_count<- na_count %>% mutate(percentage = na_count / nrow(FoodDF_new)*100)

Now, we may work with our recently created dataframe. We want to have a big picture regarding the number of columns that contain a big or a small number of missing observations (rows). Let us take a look at the following histogram, with it we intend to see how many columns count with a 0%-10% of missing information, 10%-20% of missing information, and so over until we meet 90%-100% of missing information :

# Add all the labels and color. Set the y axis indexes horizontal. Set limits for axis and
# Drawing lables on top of bars
hist(na_count$percentage,main="Missing information", xlab = "Percentage of missing information [%]", ylab = "Number of columns in database",border="#1F78B4", col= '#A6CEE3',las=1,xlim=c(0,100),ylim=c(0,240),labels=TRUE)

We may see that the data frame has still a big number of columns that present a high level of missing information. In general terms, more than 243 columns have more than a 70% of missing information. We may also see that 0 columns miss between a 50% and a 70%, and that only 61 columns have more than a 50% of their information complete. We decide to drop all of the columns that present more than a 40% of missing information. With this, we should now have 54 columns in the database.

FoodDF_new = FoodDF_new[,!sapply(FoodDF_new, function(x) mean(is.na(x)))>0.4]
#ncol(FoodDF_new)

If we take a look at the 54 columns in the cleaned dataframe, we will see that there are some of them that identify the unit in which the specific nutrient is measured. For instance, we may find that whilst the calcium content is measured in [“mg”], the carbohydrates are measured in [“g”]. These columns may be identified because they contain the word “unit” on it. An important characteristic of the “unit” columns is that for a same column, the string doesn’t vary from row to row. Therefore, we will propose to make a new dataframe that contains this information in a simplified format. With this, we aim to drop the unit columns from the cleaned dataframe, while maintaining the unit information for appropriately managing the units in the future.

#install.packages("kableExtra")
library(kableExtra)
library(knitr)
num_data <- select_if(FoodDF_new,is.numeric)
unit <- FoodDF_new %>% select(ends_with("_unit")| ends_with("_volume")| ends_with((".unit")))   
unit2 <- as.data.frame(t(unit)) 
unit2 <- unit2 %>% mutate(V1 = coalesce(V1,V21)) %>% select(V1)

Min_data <- c()
Max_data <- c()
Name_data <- c()

for (i in 1:length(num_data)) {
  x <- min(num_data[i],na.rm = TRUE)
  Min_data <- c(Min_data,x)
}

for (i in 1:length(num_data)) {
  y <- max(num_data[i], na.rm = TRUE)
  Max_data <- c(Max_data,y)
}

Sum_data <- mutate(unit2, Min = Min_data[-1], Max = Max_data[-1], Name = colnames(num_data[,-1]))%>% as.data.frame()   
  colnames(Sum_data)[1] <- "Unit"
  Sum_data ["nutrients.cholesterol.unit",1] <- "mg"
  Sum_data ["nutrients.fatty_acids_total_saturated.unit",1] <- "g"
  Sum_data ["nutrients.fatty_acids_total_trans.unit",1] <- "g"
  Nutrient <- rownames(Sum_data)
Sum_data <- cbind(Nutrient,Sum_data)


Sum_data <- as.data.frame(Sum_data)


library("DT")
datatable(Sum_data, options = list(), class = "display",
    callback = JS("return table;"), colnames,
    caption = NULL, filter = c("none", "bottom", "top"), escape = TRUE,
    style = "auto", width = NULL, height = NULL, elementId = NULL,
    fillContainer = getOption("DT.fillContainer", NULL),
    autoHideNavigation = getOption("DT.autoHideNavigation", NULL),
    selection = c("multiple", "single", "none"), extensions = list(),
    plugins = NULL, editable=FALSE)
# Drop columns that contain the word unit
df <- FoodDF_new %>% select(-contains("unit"))
#ncol(df)
# Drop columns that contain less than 2 unique values (NA+unique value)
df<-Filter(function(x)(length(unique(x))>2), df)
#ncol(df)
#Simplifying the names of the columns in the dataframe
for ( col in 1:ncol(df)){
    colnames(df)[col] <-  sub(".per_hundred", "", colnames(df)[col])
    colnames(df)[col] <-  sub("nutrients.", "", colnames(df)[col])
}

Along with eliminating the unit columns, we have eliminated all of those columns that had less than 2 unique values among their rows. We eliminate these columns, because they don´t offer variability, hence they are useless for future comparisons among variables. In the filter, we use 2 unique values instead of 1, because NA is considered an additional unique value in the count.

The new results are summarized in the following plots:

#ncol(df)
vis_dat(df,warn_large_data = FALSE) + scale_fill_brewer(palette="Paired")

#(values = c("character" ="#3D5941FF","integer" = "#B5B991FF","numeric"= "#EDBB8AFF","NA" = "gray"))
missing.values <- df %>%
    gather(key = "key", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(key, is.missing) %>%
    summarise(num.missing = n()) %>%
    filter(is.missing==T) %>%
    select(-is.missing) %>%
    arrange(desc(num.missing)) 
missing.values <- df %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

levels <-
    (missing.values  %>% filter(isna == T) %>% arrange(desc(pct)))$key

percentage.plot <- missing.values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_brewer(palette="Paired",name="",labels = c("Present", "Missing"))+
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")

percentage.plot

miss_plot <- plot_missing(df) 

We may see that none of the columns is now missing an excessive amount of information. Still, there is a relevant question to answer regarding the possibility of dropping rows. Dropping rows in the following chapter will be covered as a way of identifying and dropping outliers and possible sources of error.

2.2.2 Row Drop: Identification of Outliers

Regular values in nutrients: Number of nutrients per hundred

If we take a further look at the way in which labels of food are constructed around the world, we may see that the content of each nutrient is presented on a base of 100 [g] of product. This means that if you consume 100 [g] of cereal, you will find certain grams of carbohydrates, fats, sugars or salt on it. As the base is 100 [g], the sum of all of the nutrients in [g], should never surpass this mass, because it would mean that one of the values has been incorrectly computed (probably by the user).

In this section we will take a look at the information in the rows. Particularly we will create a new column called mass_nutrients. This column contains the sum of proteins, fat, carbohydrates, salt and fiber. If we take a look at the per_hundred table that was presented in section 2.2.1, we may see that all of these nutrients are presented in grams. Hence, we will not have to do any unit conversions. With this, we will make sure the information in the database correctly contains 100 [g] of nutrients. Any product that surpasses the base will be dropped from the observations.

# This adds the new column to the data frame.
df$mass_nutrients = rowSums(df[,c("protein", "carbohydrates", "sodium", "fiber", "fat","sugars")], na.rm = TRUE)
# layout boxplot is at the bottom 
library()
nf <- layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(3,1))
par(mar=c(3.1, 3.1, 1.1, 2.1))

hist(df$mass_nutrients,xlim=c(0,250), ylim=c(0,0.02), breaks=2000, freq=FALSE, border="#1F78B4", col= '#A6CEE3',main="Distribution of mass nutrients")

boxplot(df$mass_nutrients, horizontal=TRUE,  outline=TRUE,ylim=c(0,250), frame=F, width = 10, border="#1F78B4", col= '#A6CEE3')

We may see that the information is mostly concentrated in values under 100, just as we had previously stated. The next step consists on dropping the rows which mass nutrients surpass the base value of 100. [g].

df1<-subset(df, mass_nutrients<=100)

perc<-((nrow(df)-nrow(df1))/nrow(df)*100)

print(paste0("We have dropped a ", round(perc, digits = 2), "% of the rows"))
## [1] "We have dropped a 29.91% of the rows"

Let’s take a look at the new histogram

# layout boxplot is at the bottom 
nf <- layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(3,1))
par(mar=c(3.1, 3.1, 1.1, 2.1))

hist(df1$mass_nutrients,xlim=c(0,100), ylim=c(0,0.04), breaks=20, freq=FALSE, border="#1F78B4", col= '#A6CEE3',main="Distribution of mass nutrients")

boxplot(df1$mass_nutrients, horizontal=TRUE,  outline=TRUE,ylim=c(0,100), frame=F, width = 10, border="#1F78B4", col= '#A6CEE3')

Minimum and maximum values of some nutrients according to baseline

In the following lines we intend to finish the cleaning process of the information. To better understand the logic that we are following it is necessary to know that we consulted several sources of information for creating a baseline. With the term baseline, we intend to say that we looked for usual values between which carbohydrates, fat and proteins may fluctuate in a usual diet. The margins that we will work with are the following:

  • Carbohydrates: 45%-65% (Dietary Reference Intakes(DRIs) - Food and Nutrition Board, Institute of Medicine)

  • Fat: 20% - 35% (Dietary Reference Intakes(DRIs) - Food and Nutrition Board, Institute of Medicine)

  • Protein: 10% - 35% (Dietary Reference Intakes(DRIs) - Food and Nutrition Board, Institute of Medicine)

Let us now visualize these 3 macro-nutrients and their respective value intervals

#install.packages("ggpubr")

carbo<-replicate(length(df1$carbohydrates), "carbohydrates")
fats<-replicate(length(df1$fat), "fat")
proteins<-replicate(length(df1$protein), "protein")
macronutrients<- c(carbo,fats,proteins)
v2<-c(df1$carbohydrates,df1$fat,df1$protein)
histogram<-data.frame(macronutrients,v2)
library(RColorBrewer)
ggviolin(histogram, "macronutrients", "v2", fill = "macronutrients",
   add = "mean", add.params = list(fill = "white"), xlab="Macronutrients", ylab="Values in [g]", main="Distribution of macronutrients")+ scale_fill_brewer(palette = "Paired")

From these plots we may see that most macro-nutrients are concentrated in the base for protein and fat, while there is a higher dispersion and shifted distribution for carbohydrates. We will maintain the visualization in a non logarithmic scale, as what is really important about his plot is to realize that the individual values of nutrients are all low in comparison to the total 100 [g], and that carbohydrates are slightly higher than fats and proteins. This is consistent with the higher values of carbohydrates expected in regular meals as a baseline. As the baseline is only a reference, we will decide not to drop any values. The reason behind this decision is the fact that we have a wide range of food. Therefore, we expect to find some products that can be mainly of kind fat, protein or carbohydrates without meaning that the user has made a mistake. In this plot we consistently see that the data is not accumulated on top of the violin plot, which is enough to think that most of our information is correctly addressed.

2.2.3 Correlation Matrix

Finally, we will take a look at the correlation matrix. Do all of the columns provide new information? In this case we are not developing any models. Hence, high correlation should not be an indicator for dropping information. Still, we would like to know if there are specific correlations that could help us better understand the behavior of the data set.

numerical<- df1%>%select_if(is.numeric)%>% select(-c("portion_quantity","alcohol_by_volume"))

numeric<-cor(numerical, use = "complete.obs") 
corrplot(numeric, method = "square", outline = T, addgrid.col = "darkgray", order="hclust", mar = c(0,0,0,0), addrect = 1, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 0.4, cl.cex = 0.5, number.cex=0.4)

From this section we may see that the variables “portion_quantity” and “alcohol_by_volume”, could not be used due to the high presence of NA variables, which affected the correlation matrix. However, we managed to calculate the correlation matrix for the rest of the numerical variables, from which the main observations are the following:

  1. The fat variable is highly correlated to the total energy variables. Which means that a great source of energetic intake comes from this specific macronutrient. This makes sense as the biggest caloric intake comes from fat with 9 [Kcal/g].
  2. The second most correlated variable with energy is carbohydrates, which usually proportions a caloric intake of 4 [Kcal/g].
  3. As expected, fat (total) and fatty acids total saturated have a high correlation, as one variable is contained in the other one.
  4. Finally, the two energetic measures (energy and energy_kcal) show a high correlation, as you may calculate one of them by multiplying the other one by a linear term.

In-Depth Analysis

In this chapter we intend to enhance an in-depth understanding of our information. It is relevant to remember that we are working in the margin of our investigation questions. Therefore, each subsection will be aimed to respond to the three main questions.

3.1 Is there a nutrition pattern in countries that have a strong prevalence of obesity?

3.1.1 Data distribution

In the following chapter we will be working with the cleaned data for responding to the first question: “Is there a nutrition pattern in countries that have a strong prevalence of obesity?”. We start by taking a look at the distribution of the quantitative data. Are the nutrients normally distributed or skewed? We selected only the numeric variables from our data set and plotted them all at once.

num_data <- select_if(df1,is.numeric)
plot_data <- num_data[,-1] %>% pivot_longer(colnames(num_data[,-1])) %>% as.data.frame()

ggplot(plot_data, aes(value)) + geom_histogram(bins=30,fill ="#1F78B4") + facet_wrap(~name, scales = "free", ncol=4) + scale_y_log10() + theme(strip.background = element_rect(colour="black",fill="#A6CEE3"))

macro_data <- num_data[c(7:11)]
macro_plot <- macro_data %>% pivot_longer(colnames(macro_data)) %>% as.data.frame()

ggplot(macro_plot,aes(value)) + geom_boxplot(fill="#1F78B4") + facet_wrap(~name,ncol=5)+ coord_flip() + theme(strip.background = element_rect(colour="black",fill="#A6CEE3"))

We clearly see that the distributions are skewed for all of the plotted nutrients. This discovery is relevant for the analysis, since the preferred measure of central tendency often depends on the shape of the distribution. In this case, as all of the information is skewed the median is the correct measure of central dispersion for representing the nutrients.

3.1.2 Comparison with baseline

In the following lines we present the mean and the median of the five main macro-nutrients (Carbohydrates, Fiber, Protein, Sugar and Fat) for each country. After grouping the data per country, we compute the proportion of the five macro nutrients in order to visualize them against the baseline we created before, considering also information for sugar and fiber this time.

library(plotly)

Mean_per_country <- df1 %>% group_by(country) %>% dplyr::summarise_if(is.numeric, mean, na.rm=TRUE)
Median_per_country<- df1 %>% group_by(country) %>% dplyr::summarise_if(is.numeric, median, na.rm=TRUE)
names <- c("Country","Fat","Carbohydrates","Protein","Sugars","Fiber")

Mean_macro <- Mean_per_country[c(1,8:12)]
  Mean_macro[2:6]<-Mean_macro[2:6]/rowSums(Mean_macro[2:6])
  Mean_macro <- rbind(Mean_macro,c("Baseline",0.20,0.5,0.18,0.10,0.02))
  colnames(Mean_macro)<- names
  Mean_macro$Fat <- as.numeric(Mean_macro$Fat)
  Mean_macro$Carbohydrates <- as.numeric(Mean_macro$Carbohydrates)
  Mean_macro$Protein<- as.numeric(Mean_macro$Protein)
  Mean_macro$Sugars <- as.numeric(Mean_macro$Sugars)
  Mean_macro$Fiber<- as.numeric(Mean_macro$Fiber)

Mean_graph <- plot_ly() %>% add_trace(data=Mean_macro,x=~Fat,y=~Country,type="bar",orientation="h",marker=list(color="#A6CEE3"),name="Fat") %>%
  add_trace(x=~Carbohydrates,y=~Country,type="bar",marker=list(color="#1F78B4"),name="Carbohydrates")%>%
  add_trace(x=~Protein,y=~Country,type="bar",marker=list(color="#B2DF8A"),name="Protein") %>%
  add_trace(x=~Sugars,y=~Country,type="bar",marker=list(color="#33A02C"),name="Sugars") %>%
  add_trace(x=~Fiber,y=~Country,type="bar",marker=list(color="#FB9A99"),name="Fiber") %>%
  layout(title = "Mean consumption of macronutrients per country",xaxis = list(title="Proportion of Macronutrients"),barmode='stack')
  
Median_macro<- Median_per_country[c(1,8:12)]
  Median_macro[2:6]<-Median_macro[2:6]/rowSums(Median_macro[2:6])
  colnames(Median_macro)<- names
  Median_macro <- rbind(Median_macro,c("Baseline",0.20,0.5,0.18,0.10,0.02))
  Median_macro$Fat <- as.numeric(Median_macro$Fat)
  Median_macro$Carbohydrates <- as.numeric(Median_macro$Carbohydrates)
  Median_macro$Protein<- as.numeric(Median_macro$Protein)
  Median_macro$Sugars <- as.numeric(Median_macro$Sugars)
  Median_macro$Fiber<- as.numeric(Median_macro$Fiber)
  
Median_graph <- plot_ly() %>% add_trace(data=Median_macro,x=~Fat,y=~Country,type="bar",orientation="h",marker=list(color="#A6CEE3"),name="Fat") %>%
    add_trace(x=~Carbohydrates,y=~Country,type="bar",marker=list(color="#1F78B4"),name="Carbohydrates")%>%
    add_trace(x=~Protein,y=~Country,type="bar",marker=list(color="#B2DF8A"),name="Protein") %>%
    add_trace(x=~Sugars,y=~Country,type="bar",marker=list(color="#33A02C"),name="Sugars") %>%
    add_trace(x=~Fiber,y=~Country,type="bar",marker=list(color="#FB9A99"),name="Fiber") %>%
    layout(title = "Median consumption of macronutrients per country",xaxis = list(title="Proportion of Macronutrients"),barmode='stack')

Mean_graph
Median_graph

3.1.3 Confidence intervals

By making a quick inspection of the results, we may see small differences between countries, yet all of them are relatively proportionate to the baseline. Let us take a further look at the behavior of the macro nutrients by computing the confidence intervals and their respective baseline value.

fr <- df1%>%filter(country=="FR")
ch <- df1%>%filter(country=="CH")
us<- df1%>%filter(country=="US")
it<- df1%>%filter(country=="IT")
de<- df1%>%filter(country=="DE")

M_fr <- lm(fat ~ 1, fr)
M_ch<- lm(fat ~ 1, ch)
M_us <- lm(fat ~ 1, us)
M_it<- lm(fat ~ 1, it)
M_de<- lm(fat ~ 1, de)

c0_fr <- coef(M_fr)
c0_ch <- coef(M_ch)
c0_us <- coef(M_us)
c0_it <- coef(M_it)
c0_de <- coef(M_de)

cc_fr <- confint(M_fr, level = 0.95)
cc_ch <-confint(M_ch, level= 0.95)
cc_us <- confint(M_us, level = 0.95)
cc_it <-confint(M_it, level= 0.95)
cc_de <-confint(M_de, level= 0.95)

x <- c(1,2,3,4,5)
xLabels=c("France", "Switzerland","USA","Italy", "Germany")
F <- c(mean(fr$fat,na.rm=TRUE),mean(ch$fat, na.rm=TRUE),mean(us$fat, na.rm=TRUE),mean(it$fat, na.rm=TRUE),mean(de$fat, na.rm=TRUE))
L <-c(cc_fr[1],cc_ch[1],cc_us[1],cc_it[1],cc_de[1])
U <-c(cc_fr[2],cc_ch[2],cc_us[2],cc_it[2],cc_de[2])

require(plotrix)
par(mfrow=c(3,2))
plotCI(x, F, ui=U, li=L, xlab="Fat", ylab="Values [%]", xlim=c(0,6), title="Interval", xaxt="n", main="Confidence intervals and mean for Fat")
abline(h=10,col="#1F78B4",text(3, 10.45, "Baseline",col="#1F78B4"))
axis(side=1, at=(1:5), labels=xLabels)

M_fr <- lm(carbohydrates ~ 1, fr)
M_ch<- lm(carbohydrates ~ 1, ch)
M_us <- lm(carbohydrates ~ 1, us)
M_it<- lm(carbohydrates ~ 1, it)
M_de<- lm(carbohydrates ~ 1, de)

c0_fr <- coef(M_fr)
c0_ch <- coef(M_ch)
c0_us <- coef(M_us)
c0_it <- coef(M_it)
c0_de <- coef(M_de)

cc_fr <- confint(M_fr, level = 0.95)
cc_ch <-confint(M_ch, level= 0.95)
cc_us <- confint(M_us, level = 0.95)
cc_it <-confint(M_it, level= 0.95)
cc_de <-confint(M_de, level= 0.95)

F <- c(mean(fr$carbohydrates,na.rm=TRUE),mean(ch$carbohydrates, na.rm=TRUE),mean(us$carbohydrates, na.rm=TRUE),mean(it$carbohydrates, na.rm=TRUE),mean(de$carbohydrates, na.rm=TRUE))
L <-c(cc_fr[1],cc_ch[1],cc_us[1],cc_it[1],cc_de[1])
U <-c(cc_fr[2],cc_ch[2],cc_us[2],cc_it[2],cc_de[2])

require(plotrix)

plotCI(x, F, ui=U, li=L, xlab="Carbohydrates", ylab="Values [%]", xlim=c(0,6), title="Interval", xaxt="n", main="Confidence intervals and mean for Carbohydrates")
abline(h=20,col="#1F78B4",text(3, 20.8, "Baseline",col="#1F78B4"))
axis(side=1, at=(1:5), labels=xLabels)

M_fr <- lm(protein ~ 1, fr)
M_ch<- lm(protein ~ 1, ch)
M_us <- lm(protein ~ 1, us)
M_it<- lm(protein ~ 1, it)
M_de<- lm(protein ~ 1, de)

c0_fr <- coef(M_fr)
c0_ch <- coef(M_ch)
c0_us <- coef(M_us)
c0_it <- coef(M_it)
c0_de <- coef(M_de)

cc_fr <- confint(M_fr, level = 0.95)
cc_ch <-confint(M_ch, level= 0.95)
cc_us <- confint(M_us, level = 0.95)
cc_it <-confint(M_it, level= 0.95)
cc_de <-confint(M_de, level= 0.95)

F <- c(mean(fr$protein,na.rm=TRUE),mean(ch$protein, na.rm=TRUE),mean(us$protein, na.rm=TRUE),mean(it$protein, na.rm=TRUE),mean(de$protein, na.rm=TRUE))
L <-c(cc_fr[1],cc_ch[1],cc_us[1],cc_it[1],cc_de[1])
U <-c(cc_fr[2],cc_ch[2],cc_us[2],cc_it[2],cc_de[2])

require(plotrix)

plotCI(x, F, ui=U, li=L, xlab="Protein", ylab="Values [%]", xlim=c(0,6), title="Interval", xaxt="n", main="Confidence intervals and mean for Proteins")
abline(h=10,col="#1F78B4",text(3, 10.2, "Baseline",col="#1F78B4"))
axis(side=1, at=(1:5), labels=xLabels)

M_fr <- lm(sugars ~ 1, fr)
M_ch<- lm(sugars ~ 1, ch)
M_us <- lm(sugars ~ 1, us)
M_it<- lm(sugars ~ 1, it)
M_de<- lm(sugars ~ 1, de)

c0_fr <- coef(M_fr)
c0_ch <- coef(M_ch)
c0_us <- coef(M_us)
c0_it <- coef(M_it)
c0_de <- coef(M_de)

cc_fr <- confint(M_fr, level = 0.95)
cc_ch <-confint(M_ch, level= 0.95)
cc_us <- confint(M_us, level = 0.95)
cc_it <-confint(M_it, level= 0.95)
cc_de <-confint(M_de, level= 0.95)

F <- c(mean(fr$sugars,na.rm=TRUE),mean(ch$sugars, na.rm=TRUE),mean(us$sugars, na.rm=TRUE),mean(it$sugars, na.rm=TRUE),mean(de$sugars, na.rm=TRUE))
L <-c(cc_fr[1],cc_ch[1],cc_us[1],cc_it[1],cc_de[1])
U <-c(cc_fr[2],cc_ch[2],cc_us[2],cc_it[2],cc_de[2])

require(plotrix)

plotCI(x, F, ui=U, li=L, xlab="Sugar", ylab="Values [%]", xlim=c(0,6), , ylim=c(0,10.5),title="Interval", xaxt="n", main="Confidence intervals and mean for Sugars")
abline(h=10,col="#1F78B4",text(3, 10.45, "Baseline",col="#1F78B4"))
axis(side=1, at=(1:5), labels=xLabels)

M_fr <- lm(fiber ~ 1, fr)
M_ch<- lm(fiber ~ 1, ch)
M_us <- lm(fiber ~ 1, us)
M_it<- lm(fiber ~ 1, it)
M_de<- lm(fiber ~ 1, de)

c0_fr <- coef(M_fr)
c0_ch <- coef(M_ch)
c0_us <- coef(M_us)
c0_it <- coef(M_it)
c0_de <- coef(M_de)

cc_fr <- confint(M_fr, level = 0.95)
cc_ch <-confint(M_ch, level= 0.95)
cc_us <- confint(M_us, level = 0.95)
cc_it <-confint(M_it, level= 0.95)
cc_de <-confint(M_de, level= 0.95)

F <- c(mean(fr$fiber,na.rm=TRUE),mean(ch$fiber, na.rm=TRUE),mean(us$fiber, na.rm=TRUE),mean(it$fiber, na.rm=TRUE),mean(de$fiber, na.rm=TRUE))
L <-c(cc_fr[1],cc_ch[1],cc_us[1],cc_it[1],cc_de[1])
U <-c(cc_fr[2],cc_ch[2],cc_us[2],cc_it[2],cc_de[2])

require(plotrix)

plotCI(x, F, ui=U, li=L, xlab="Fiber", ylab="Values [%]", xlim=c(0,6), title="Interval", xaxt="n", main="Confidence intervals and mean for Fiber")
abline(h=2,col="#1F78B4",text(3, 2.25, "Baseline",col="#1F78B4"))
axis(side=1, at=(1:5), labels=xLabels)

From the confidence intervals we may see that USA is the country that contains the biggest amount of information, followed by Switzerland and Germany. Some relevant results are:

  • The consumption of every macro nutrient is lower in USA, except for sugar, which is over the consumption values in comparison to all of the other countries.

  • Italy and Germany consume carbohydrates over the baseline, and in higher proportions than the rest of the countries.

  • Italy presents the highest consumption of fat, the value is considerably different from the consumption in the rest of the countries.

  • The consumption of proteins is similar among all of the European countries. USA presents a significantly lower consumption interval.

  • Germany has the highest consumption interval of fiber. USA presents a lower consumption than the European countries.

3.1.4 Most common ingredients per country

We will finish this question by taking a look at the most commonly found ingredients per country. To do this, we will work with the qualitative data presented in the ingredients.en column. This variable contains the list of ingredients in each labeled product in English.

ingredients_total <- select(df1,country,ingredients_translations.en) %>%
  na.omit(df1)
#Сreating subsets with ingredients for each country
{ingredients_ch <- filter(ingredients_total,country=='CH')
 ingredients_de <- filter(ingredients_total,country=='DE')
 ingredients_fr <- filter(ingredients_total,country=='FR')
 ingredients_it <- filter(ingredients_total,country=='IT')
 ingredients_us <- filter(ingredients_total,country=='US')}

#Creating one common ingredient list in lower register for each country
{switzerland_list <- toString(tolower(paste(ingredients_ch$ingredients_translations.en)))
 germany_list <- toString(tolower(paste(ingredients_de$ingredients_translations.en)))
 france_list <- toString(tolower(paste(ingredients_fr$ingredients_translations.en)))
 italy_list <- toString(tolower(paste(ingredients_it$ingredients_translations.en)))
 united_states_list <- toString(tolower(paste(ingredients_us$ingredients_translations.en)))}

Once the necessary information has been converted into strings, we can create a function to count the most frequently occurring words. First, this function removes all unnecessary information, for example, removes non-letter symbols, extra spaces, words that occur frequently but are not ingredients (“a”, “the” etc…). Then the function splits the obtained cleaned string by words and counts the number of occurrences.

# function to sort and count ingredients 
countwords = function(strings){
  
  # remove line breaks
  w1 <- gsub(pattern = '\n', replacement=" ", strings)
  w1 <- gsub('[(\")(\\)]', '', deparse(w1))
  
  # remove punctuations
  w2 <- gsub(pattern="[[:punct:]0-9]", replacement=" ", x=w1)
  
  #Deleting stopwords
  stopwords = c(' a ',' an ',' the ',' it ', ' with ',' without ',' is ',' are ',' of ',' from ',' on ',' in ', ' may ','contain',' all ',' and ',' fresh ',
                ' organic ',' natural ','cooking ','sunflower ','powder ','olive ','acid ','juice ','extract ','sodium ',' gum ',' flavor ',' citric ', 
                ' flour ',' acidity ', ' glucose ', ' dextrose ', ' rapeseed ', ' switzerland ')
  
  w3 <- gsub(paste0(stopwords,collapse = "|"),"", w2)
  
  # remove all single-letter words 
  w4 <- gsub('\\b\\w{1}\\b',"",w3)
  
  # remove unnecessary spaces
  w5 <- gsub(" {2,}"," ", w4) # remove extra spaces between words
  
  w5 <- str_trim(w5, "right") #remove all trailing whitespace

  w5 <- str_trim(w5, "left") #remove all leading whitespace
  
  # split the words
  w6 <- strsplit(w5, " ")
  
  # sort words in table
  w7 <- data.frame(sort(table(w6, exclude=""), decreasing=TRUE))
  
  # rename columns
  colnames(w7)[1] <- 'ingredient'
  colnames(w7)[2] <- 'frequency'
  w7
}

#runing the function for each country
{switzerland <- countwords(switzerland_list)
 germany <- countwords(germany_list)
 france <- countwords(france_list)
 italy <- countwords(italy_list)
 united_states <- countwords(united_states_list)}

Now let’s have a look at the top 10 most common ingredients for each country:

#tables with top 10 ingredients
{switzerland_top <- switzerland[1:10,]
 germany_top <- germany[1:10,]
 france_top <- france[1:10,]
 italy_top <- italy[1:10,]
 united_states_top <- united_states[1:10,]}


#1) Graph for Switzerland
top_ch <- ggplot(switzerland_top)+
  geom_bar(aes(x=ingredient, y=frequency,fill="#A6CEE3"), position="dodge", stat = 'identity')+
  scale_fill_manual( values = c("#A6CEE3"))+
  theme(legend.position = "none", axis.title.y = element_text(size=10), axis.title.x = element_text(size=10),
        axis.text.x = element_text(angle = 30,size = 9), panel.background = element_rect(fill = "white" ))+
  labs(title = 'Switzerland',
       x = 'Ingredients',
       y = 'Number of appearances')


#2) Graph for Germany
top_de <- ggplot(germany_top)+
  geom_bar(aes(x=ingredient, y=frequency,fill="#1F78B4"), position="dodge", stat = 'identity')+
  scale_fill_manual( values = c("#1F78B4"))+
  theme(legend.position = "none", axis.title.y = element_text(size=10), axis.title.x = element_text(size=10),
        axis.text.x = element_text(angle = 30,size = 9), panel.background = element_rect(fill = "white" ))+
  labs(title = 'Germany',
       x = 'Ingredients',
       y = 'Number of appearances')


#3) Graph for France
top_fr <- ggplot(france_top)+
  geom_bar(aes(x=ingredient, y=frequency,fill="#B2DF8A"), position="dodge", stat = 'identity')+
  scale_fill_manual( values = c("#B2DF8A"))+
  theme(legend.position = "none", axis.title.y = element_text(size=10), axis.title.x = element_text(size=10),
        axis.text.x = element_text(angle = 30,size = 9), panel.background = element_rect(fill = "white" ))+
  labs(title = 'France',
       x = 'Ingredients',
       y = 'Number of appearances')


#4) Graph for Italy
top_it <- ggplot(italy_top)+
  geom_bar(aes(x=ingredient, y=frequency,fill="#33A02C"), position="dodge", stat = 'identity')+
  scale_fill_manual( values = c("#33A02C"))+
  theme(legend.position = "none", axis.title.y = element_text(size=10), axis.title.x = element_text(size=10),
        axis.text.x = element_text(angle = 30,size = 9), panel.background = element_rect(fill = "white" ))+
  labs(title = 'Italy',
       x = 'Ingredients',
       y = 'Number of appearances')


#5) Graph for the United States
top_us <- ggplot(united_states_top)+
  geom_bar(aes(x=ingredient, y=frequency,fill="#FB9A99"), position="dodge", stat = 'identity')+
  scale_fill_manual( values = c("#FB9A99"))+
  theme(legend.position = "none", axis.title.y = element_text(size=10), axis.title.x = element_text(size=10),
        axis.text.x = element_text(angle = 30,size = 9), panel.background = element_rect(fill = "white" ))+
  labs(title = 'United States',
       x = 'Ingredients',
       y = 'Number of appearances')


#Сommon graph
figure2 <- ggarrange(top_ch, top_de, top_fr, top_it, top_us,
                     ncol = 3, nrow = 2)
annotate_figure(figure2,
                top = text_grob("10 most common ingredients by country", face = "bold", size = 14))

As we can see, salt is the most common ingredient in products for all countries. Next, among the top ingredients in all countries we find water, sugar, oil, wheat and milk at different positions. The rest of the ingredients are different and country-specific. For example, corn, cheese, starch, and garlic are frequently found in American products.France is characterized by high consumption of products containing pork, paprika and potatoes. Among the ingredients commonly consumed in Italy we find baking products, such as barley, malt, and yeast. In Switzerland and Germany, starch and spices are often present in products.

3.1.5 Obesity in 2016

A relevant point that we need to understand to fully respond to our investigation question is the obesity prevalence around the world. For this purpose we will use the information of obesity in 2016 (most recent) to visualize the prevalence of this condition in our 5 countries of interest. By clicking on each country we can display the information over the interactive plot . It is relevant to remember that the FoodRepo database is fixed on time. Therefore, using an in time dynamic evolution of obesity would not make sense in this context.

library(maps)
#Rearranging the data
c_names <- c("Switzerland","Italy","France","Germany","USA")

inter_obesity <- filter(OB_W,year=="2016")%>% filter(code== "FRA"|code=="DEU"|code=="ITA"|code=="CHE"|code=="USA")%>% arrange()
inter_obesity$obesity_rate <- as.numeric(inter_obesity$obesity_rate) 

obesity <- as.vector(inter_obesity$obesity_rate)

#Extracting world map
coord_countries <- map("world",c_names,exact=TRUE,fill=TRUE,plot=FALSE) 
coord_countries$value <- obesity 

#Creating interactive graph
pal <- colorNumeric("RdYlGn",domain=coord_countries$value,n=5,reverse=TRUE)

inter_graph <- leaflet(coord_countries) %>% addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addPolygons(color =~pal(value),weight=1,popup= paste("Country: ",coord_countries$names,"<br>","Obesity Rate: ",coord_countries$value,"<br>"),highlightOptions = highlightOptions(color="black",weight=2,bringToFront=TRUE))%>%
  addLegend("bottomright",pal=pal, values=~value,title="Obesity Rate (2016)",opacity=1)%>%
  setView(lng=-39.443407,lat= 35.649315,zoom=2.4)

inter_graph

We clearly see here that the United States presents a higher rate of obesity than the rest of the western countries in the analysis. Italy, France, Switzerland and Germany range between 19.5% and 22.3% whereas the USA goes up to 36.2%. For the following analysis it could be interesting to address the differences between USA and the rest of the countries, as Americans got the highest levels of differentiation in terms of comparative obesity prevalence.

3.1.6 Comparative analysis

If we compare the obesity results in 3.1.4 and the confidence intervals in 3.1.3, we may see that the only interval in which USA showed a higher consumption rate is sugars. As discussed in section 2.2.3, the caloric energy would be expected to be relevant in obesity trends, as a higher caloric intake should be related to weight gain. The caloric intake was not highly correlated with sugars according to the correlation matrix we computed in the section 2.2.3. If we now take a look at the values that have the highest correlation with caloric intake (fat and carbohydrates), we will see that they don’t show a higher consumption in USA. Moreover, the countries that present the lowest obesity prevalence (Switzerland and Italy), present the highest consumption of fat.

We understand that high obesity prevalence is a multivariate problem, that does not only depend on the consumption of labeled products. Yet, by considering only the Food database results, the following hypothesis may be stated:

  • As carbohydrates and fats are correlated with a high caloric intake, we would expect them to be the main macro nutrients responsible for high obesity prevalence. The food consumption patterns presented in the analysis have shown that countries with diets that are richer in fat and carbohydrates don’t present a clear link with high obesity prevalence

  • As sugar is the only macro nutrient that presented a relevant positive difference in the USA regarding the rest of the countries, we may consider its presence as a relevant factor at the moment of predicting obesity prevalence.

  • Regarding the results obtained by USA, we may add that a low consumption of fiber and proteins may be an indicator of higher risk of obesity.

  • Finally, we need to make clear that from this analysis we may not have information about the portions and quantities that are consumed per person in each country. This could be a relevant factor at the moment of tracing obesity trends.

3.2 Do countries that consume more organic products present a consistent improvement of health indicators over time?

3.2.1. Analysis of organic vs. non-organic products distribution

To understand the distribution of consumption of organic products by country, we will create subsets for organic and non-organic products and compare the absolute and relative values with the total consumption. We define a product as organic if its name contains the words “organic” or “bio” in the English translation column.

organic <- filter(df1,(str_detect(df1$display_name_translations.en,'organic| bio')))
non_organic <- anti_join(df1,organic)
nrow(organic)
## [1] 270

We can see that only 270 out of 264,658 observations can be identified as organic . This value indicates a low consumption of organic products. Even though, we understand that 270 observations is a small number, we still would like analyze the ratio of organic and non-organic products by country.

# creating table with the amount of organic and non-organic products by country
#install.packages("ggpubr")
library(ggpubr)
df2 <- select(df1,country,display_name_translations.en )
df2$type = ifelse(str_detect(df1$display_name_translations.en,'organic | bio'), 'organic', 'non-organic')

type1 <- as.data.frame(
  df2 %>% 
  group_by(country) %>%
  count(type)) 

# adding column with the percentage of organic and non-organic products in total amount
type1 <-as.data.frame(
  type1 %>%
  group_by(country) %>%
  mutate(percent = 100*n/sum(n)))


#Visualization

#Total consumption of organic and non-organic food by country
#1)
organic_total_percent <- ggplot(type1)+
  geom_bar(aes(x=country, y=percent, fill=type), stat = 'identity')+
  scale_fill_manual("Type of product", values = c("non-organic" = "#A6CEE3", "organic" = "#1F78B4"))+
  theme(legend.position = "right")+
  labs(title = 'In percents',
       x = 'Country',
       y = 'Percentage of total consumption (%)',
       fill = 'Type of product')

#2)
organic_total_absolute <- ggplot(type1)+
  geom_bar(aes(x=country, y=n, fill=type), position="dodge", stat = 'identity')+
  scale_y_log10()+
  scale_fill_manual("Type of product", values = c("non-organic" = "#A6CEE3", "organic" = "#1F78B4"))+
  theme(legend.position = "right")+
  labs(title = 'Absolute values',
       x = 'Country',
       y = 'Number of products',
       fill = 'Type of product')

#1)+2)
figure1 <- ggarrange(organic_total_percent, organic_total_absolute,
                    ncol = 1, nrow = 2)
annotate_figure(figure1,
                top = text_grob("Total consumption of organic and non-organic food by country", face = "bold", size = 14))

By analyzing the results, we may see that the consumption of organic products is marginal when we compare it in terms of percentage and absolute values with the non-organic products.

As the number of entries among countries is not the same, it is difficult to compare which country has a higher consumption of organic. For making this comparison we will therefore analyze the distribution of organic consumption [%] per country.

#3)

# table with organic products only
type2 <-filter(type1, type == 'organic')

organic_only <- ggplot(type2)+
  geom_bar(aes(x=country, y=percent, fill=type), position="dodge", stat = 'identity')+
  scale_fill_manual("Type of product", values = c("organic" = "#1F78B4"))+
  theme(legend.position = "right")+
  labs(title = 'Consumption of organic food by country',
       x = 'Country',
       y = 'Percentage of total consumption (%)',
       fill = 'Type of product')
organic_only

Now we can clearly observe that the leaders in the consumption of organic products are Switzerland and France. For Switzerland this statement is more robust because the sample of Swiss products in our database is significantly larger than the French one. We also note that in the United States, the consumption of organic products is extremely low, even though this country counts with a large sample.

3.2.2. Comparison of macro nutrient composition in organic and non-organic products

With this new information, we want to compare, as we did with countries, the distribution of the proportion of macronutrients in organic and non-organic products.

non_organic$Organic <- "N"
organic$Organic <- "O"
df3 <- rbind(organic,non_organic)
Median_per_org <- df3 %>% group_by(Organic) %>% dplyr::summarise_if(is.numeric, median, na.rm=TRUE)

Macro_per_org <- Median_per_org[c(1,8:12)]
Macro_per_org[2:6]<-Macro_per_org[2:6]/rowSums(Macro_per_org[2:6])

Organic_graph <- plot_ly() %>% 
  add_trace(data=Macro_per_org,x=~fat,y=~Organic,type="bar",orientation="h",marker=list(color="#A6CEE3"),name="Fat") %>%
  add_trace(x=~carbohydrates,y=~Organic,type="bar",marker=list(color="#1F78B4"),name="Carbohydrates")%>%
  add_trace(x=~protein,y=~Organic,type="bar",marker=list(color="#B2DF8A"),name="Protein") %>%
  add_trace(x=~sugars,y=~Organic,type="bar",marker=list(color="#33A02C"),name="Sugars") %>%
  add_trace(x=~fiber,y=~Organic,type="bar",marker=list(color="#FB9A99"),name="Fiber") %>%
  layout(title = "Median macronutrients for Organic and Non-Organic products",xaxis = list(title="Proportion of Macronutrients"),barmode='stack')

Organic_graph

This visualization shows that non-organic products tend to have more fat and sugar but also less protein and fiber. These values will have to be tested to be able to say that differences between organic and non-organic are significant. In this case we wish to respond to the question: Is there a difference in the macro nutrients contained in organic and non-organic products?

3.2.3. Confidence Intervals

As we have already addressed, the amount of organic observations is almost marginal in comparison to non-organic ones. A correct approach for working with information that does not contain a big number of observations, is to check the confidence intervals. What we would expect to see are larger intervals for the columns with less information (organic), and smaller ones for the columns with more information (non-organic).

Through this analysis we compare the confidence intervals in each macro nutrient for organic versus non-organic. We would expect to see that the confidence intervals formed with a larger number of observations (non-organic) are always contained within the range of the ones formed by less data (organic). If so, we may conclude that there is no significant difference in the behavior of the macro nutrient. In the case in which the intervals do not match, we may suppose that for that specific macro nutrient, differences between the products may be found. It is relevant to highlight that this analysis is done regardless the country of origin of each product.

M <- lm(fat ~ 1, organic)
M_n<- lm(fat ~ 1, non_organic)
c0 <- coef(M)
c0_n <- coef(M_n)
cc <- confint(M, level = 0.95)
cc_n <-confint(M_n, level= 0.95)
x <- c(1,2)
xLabels=c("Organic", "Non-Organic")
F <- c(mean(organic$fat,na.rm=TRUE),mean(non_organic$fat, na.rm=TRUE))
L <-c(cc[1],cc_n[1])
U <-c(cc[2],cc_n[2])
require(plotrix)
par(mfrow=c(3,2))
plotCI(x, F, ui=U, li=L, xlab="Fat", ylab="Values [%]", xlim=c(0,3), title="Interval", xaxt="n", main="Confidence intervals and mean for Fat")
axis(side=1, at=(1:2), labels=xLabels)
M <- lm(carbohydrates ~ 1, organic)
M_n<- lm(carbohydrates ~ 1, non_organic)
c0 <- coef(M)
c0_n <- coef(M_n)
cc <- confint(M, level = 0.95)
cc_n <-confint(M_n, level= 0.95)
x <- c(1,2)
xLabels=c("Organic", "Non-Organic")
F <- c(mean(organic$carbohydrates,na.rm=TRUE),mean(non_organic$carbohydrates, na.rm=TRUE))
L <-c(cc[1],cc_n[1])
U <-c(cc[2],cc_n[2])
plotCI(x, F, ui=U, li=L, xlab="Carbohydrates", ylab="Values [%]", xlim=c(0,3), title="Interval", xaxt="n", main="Confidence intervals and mean for Carbohydrates")
axis(side=1, at=(1:2), labels=xLabels)
M <- lm(protein ~ 1, organic)
M_n<- lm(protein ~ 1, non_organic)
c0 <- coef(M)
c0_n <- coef(M_n)
cc <- confint(M, level = 0.95)
cc_n <-confint(M_n, level= 0.95)
x <- c(1,2)
xLabels=c("Organic", "Non-Organic")
F <- c(mean(organic$protein,na.rm=TRUE),mean(non_organic$protein, na.rm=TRUE))
L <-c(cc[1],cc_n[1])
U <-c(cc[2],cc_n[2])
plotCI(x, F, ui=U, li=L, xlab="Proteins", ylab="Values [%]", xlim=c(0,3), title="Interval", xaxt="n", main="Confidence intervals and mean for Proteins")
axis(side=1, at=(1:2), labels=xLabels)
M <- lm(sugars ~ 1, organic)
M_n<- lm(sugars ~ 1, non_organic)
c0 <- coef(M)
c0_n <- coef(M_n)
cc <- confint(M, level = 0.95)
cc_n <-confint(M_n, level= 0.95)
x <- c(1,2)
xLabels=c("Organic", "Non-Organic")
F <- c(mean(organic$sugars,na.rm=TRUE),mean(non_organic$sugars, na.rm=TRUE))
L <-c(cc[1],cc_n[1])
U <-c(cc[2],cc_n[2])
plotCI(x, F, ui=U, li=L, xlab="Sugars", ylab="Values [%]", xlim=c(0,3), title="Interval", xaxt="n", main="Confidence intervals and mean for Sugars")
axis(side=1, at=(1:2), labels=xLabels)
M <- lm(fiber ~ 1, organic)
M_n<- lm(fiber ~ 1, non_organic)
c0 <- coef(M)
c0_n <- coef(M_n)
cc <- confint(M, level = 0.95)
cc_n <-confint(M_n, level= 0.95)
x <- c(1,2)
xLabels=c("Organic", "Non-Organic")
F <- c(mean(organic$fiber,na.rm=TRUE),mean(non_organic$fiber, na.rm=TRUE))
L <-c(cc[1],cc_n[1])
U <-c(cc[2],cc_n[2])
plotCI(x, F, ui=U, li=L, xlab="Fiber", ylab="Values [%]", xlim=c(0,3), title="Interval", xaxt="n", main="Confidence intervals and mean for Fiber")
axis(side=1, at=(1:2), labels=xLabels)

As expected, we see that for every macro nutrient, the organic intervals present a wider range than the non-Organic ones. In the first three plots we may see that every non-organic interval may be found in the range of its organic counterpart. Yet, for the case of sugars and fibers a relevant difference may be spotted. In both cases the confidence interval of the non-organic, is outside the organic range. This is the way in which we had previously proposed to indicate the existence of differences between the values of macro nutrients. Hence, we may conclude that the values of sugars are higher in non-organic products, in comparison with organic ones. Moreover, the value of fiber is lower in non-organic in comparison with organic products.

3.2.5. Changes in health indicators for our 5 countries of interest

From the previous analysis, we see that organic products are indeed “healthier” in terms of sugar and fiber content. However, let us have a look at health indicators for countries that consume more and less organic products and try to answer the question: “Do the countries that consume more organic products present a consistent improvement of health indicators over time?”

Let us look at the percentage of adults who are obese as the main indicator of health in our study and consider life expectancy at birth in addition. Factors that affect a person’s life expectancy include gender, available healthcare and sanitary care, diet and food quality, level of physical activity, lifestyle, mental health, social environment and level of consumption, etc. However, most of these factors are health determinants, which means we can use life expectancy at birth as a proxy measure of health. We will take this indicator from the HDR dataset as it is a sub index of HDI. Since the available indicators are presented for several years, we will be able to visualize their evolution over time and evaluate the trend.

First of all, we will pay attention to the change in health indicators in the countries with the highest and lowest levels of consumption of organic products - Switzerland and the United States, respectively.

For this purpose, we create an interactive map with a selection of indexes with ShinyApp. You can see the map itself and try it out in in the independent shiny markdown, which also contains the steps for creating this map.

(Static demonstration of the map. For the interactive one, refer to Shiny.Rmd document)

We can observe the following evident patterns in terms of changes in health indicators with the help of this map:

  1. Obesity rate:

Over the past decades, the percentage of people who are obese has been rising steadily in all of the countries considered, which implies a less healthy lifestyle. We can observe, however, that relatively to other countries, Switzerland has the lowest rate of obesity in all these years. Overall, all of the European countries in our sample have a better performance on this indicator. While the U.S. has been the leader in the number of obese people in time.

  1. Life expectancy at birth:

The second proxy indicator of health, on the contrary, is gradually improving in all countries over time. But also, as with the obesity indicator, we see a similar behavior between countries. Namely, Switzerland consistently has the highest life expectancy compared to other countries, while the United States always has the lowest life expectancy among the available sample of countries.

3.3 Is it always true that highest economic indexes lead to better health conditions?

In this final analysis section we will further discuss the relationship between health and economic indicators. Can we expand our results to other relevant proxies, such as HDI index or Gross Income per capita? If not, can we find a global trend for better understanding the relationship between the different indexes?

3.3.1 Local Analysis for our 5 countries of interest

In order to address this question, we have used as a proxy for health, the obesity rate of each country. We decide to test this indicator against the Human Development Index (HDI) and the Gross Income per capita (GI) per country.

We first extract the proper information from the databases of Obesity and Human Development and then rearrange the columns . With this, we can visually represent the relationship, first with HDI and then with GI.

hdr_3 <- as.data.frame(hdr_data)

HD_I <- hdr_3[,c(1:3,6:37)]
HD_G <- hdr_3[,c(1:2,134:160)]

#Make the names comparable with the Obesity dataset

names(HD_I) <- gsub("hdi_", "", names(HD_I))
names(HD_G) <- gsub("gnipc_", "", names(HD_G))


HD_I <- melt(setDT(HD_I))
HD_G <- melt(setDT(HD_G))
colnames(HD_I)[4] <- 'year'
colnames(HD_I)[5] <- 'hdi'
colnames(HD_G)[3] <- 'year'
colnames(HD_G)[4] <- 'gi'

HD <- inner_join(HD_I,HD_G,c("country" = "country", "year" = "year"))
HD <- HD[,-6]
colnames(HD)[1]<- "code"

HD$year<-as.character(HD$year)
Obesity_hdi_World <- inner_join(OB_W, HD,c("code" = "code", "year" = "year"))

#Obesity_hdi_World$obesity_rate <- as.numeric(Obesity_hdi_World$obesity_rate)
Obesity_hdi_World$hdi <- as.numeric(Obesity_hdi_World$hdi)
Obesity_hdi_World <- Obesity_hdi_World [-6]
colnames(Obesity_hdi_World)[2]<- "country"
#Visualize relationship between HDI and Obesity Rate for our countries of interest
HD_OB_5 <-filter(Obesity_hdi_World,code== "FRA"|code=="DEU"|code=="ITA"|code=="CHE"|code=="USA")
HD_OB_5$obesity_rate <- as.numeric(HD_OB_5$obesity_rate)

rel <- ggplot(HD_OB_5,aes(x=year,y=obesity_rate,color=country,size=hdi))+
  scale_color_brewer(palette="Paired",name="Country")+
  geom_point(stat="identity")+
  labs(x="Year",y="Obesity Rate")+
  scale_size(range=c(0.5,4),name="Human Development Index")+
  scale_x_discrete(breaks = seq(1990,2016,5))
rel

#Visualize relationship between Gross Income per capita and Obesity Rate for our countries of interest

rel2 <- ggplot(HD_OB_5,aes(x=year,y=obesity_rate,color=country,size=gi))+
  scale_color_brewer(palette="Paired",name="Country")+
  geom_point(stat="identity")+
  labs(x="Year",y="Obesity Rate")+
  scale_size(range=c(0.5,4),name="Gross Income per Capita")+
  scale_x_discrete(breaks = seq(1990,2016,5))

rel2

From the plots, let us start by mentioning that an increase in the size of the points, represents a growth in time of the Human Development Index in plot (1), and a growth time of the Gross Income per Capita in plot (2). If we contrast the different sizes of the points with the overall tendency, which represents the evolution of obesity in time, we may address the existence of a relationship (bigger points are correlated with a higher point in the y axis). This would suggest that among the set of 5 analyzed countries, earning more money and climbing the HDI ladder would actually be linked to an increase in obesity rates. If we take a closer look, we can also observe that, whilst Switzerland has the highest HDI and GI and lowest obesity rate of all five. The US, on the contrary has the highest obesity rate and a lower HDI and GI compared to Switzerland. Regarding the rest of the countries, we affirm that they all move in a similar obesity rate, though the GI may fluctuate noticeably from country to country. As we understand that this analysis is not enough to state a relationship among indexes, we will investigate this relationship further.

3.3.2 Analysis using a broader range of countries

As we have previously commented, the sample of 5 countries that we have used for the analysis is not completely representative of the entire world conditions regarding health and economic indexes. For further analysis we have decided to use the rest of the developed countries (192) that were in our two databases. For this visualization, we used the mean of each indicator . This enables us to visualize the relationship between being overweight and living in a country with a certain level of life in a time-independent scenario. As we are interested mainly in the impact of these parameters in developed countries, we have decided to filter the countries having a HDI score of over 0.7 (based on United Nations classification).

OB_HD_DV <- filter(Obesity_hdi_World,hdicode=="Very High"|hdicode=="High")

Mean_HDI_OB <- OB_HD_DV %>% group_by(code) %>% summarize(HDI=mean(hdi),OB=mean(obesity_rate),GI=mean(gi))

ggplot(Mean_HDI_OB,aes(x=HDI,y=OB))+
  scale_color_brewer(palette="Paired")+
  geom_point(stat="identity")+
  labs(x="Human Development Index",y="Obesity Rate")

ggplot(Mean_HDI_OB,aes(x=GI,y=OB))+
  scale_color_brewer(palette="Paired")+
  geom_point(stat="identity")+
  labs(x="Gross Income",y="Obesity Rate")

By visual inspection we may not spot a clear linear relationship between the variables. Yet, if we force a linear relationship over this curve we could state the existence of a positive slope. In order to test this finding, we will continue using our developed countries database and test their linear relationship for explaining obesity through HDI and GI.

3.3.3. Testing relationship between HDI, GI and Obesity Rate

Let us start this analysis by calculating the correlation coefficients between the dependent variable (obesity) and the proposed explanatory variables (HDI and GI).

#Looking at the correlation coefficients and testing significance
OB_HD_DV$obesity_rate <- as.numeric(OB_HD_DV$obesity_rate)
OB_HD_DV$hdi <- as.numeric(OB_HD_DV$hdi)
OB_HD_DV$gi <- as.numeric(OB_HD_DV$gi)

coef_HD <- cor(OB_HD_DV$obesity_rate,OB_HD_DV$hdi,use="complete.obs")
coef_GI <- cor(OB_HD_DV$obesity_rate,OB_HD_DV$gi,use="complete.obs")

coef_HD <- signif(coef_HD,3)
coef_GI <- signif(coef_GI,3)

paste0("The correlation coefficient between HDI and Obesity rate is ",coef_HD," & the correlation coefficient between GI and Obesity rate is ",coef_GI)
## [1] "The correlation coefficient between HDI and Obesity rate is 0.275 & the correlation coefficient between GI and Obesity rate is 0.167"

The obtained coefficients are in accordance with our visual inspection. As expected, the correlation values are low, and they do not surpass a 30% in any of the cases. Yet, the two relationships are positive.

library(huxtable)
reg_HO <- lm(OB_HD_DV$obesity_rate~OB_HD_DV$hdi+OB_HD_DV$gi)

huxreg(reg_HO)
(1)
(Intercept)-0.039 *  
(0.016)   
OB_HD_DV$hdi0.299 ***
(0.023)   
OB_HD_DV$gi-0.000 ***
(0.000)   
N2761        
R20.080    
logLik3219.199    
AIC-6430.399    
*** p < 0.001; ** p < 0.01; * p < 0.05.

By analyzing the regression coefficients and their p-values we identify a high significance level for the relationship between obesity and HDI. On the other hand, Gross Income per Capita does not have an effect over the prevalence of obesity in developed countries, when a linear regression model is adjusted.

If we quantify the Human Development Index in terms of its coefficient, we find that an increase of 1 point in the HDI, results in an increase in the obesity rate of 0.3 points. This result can be explained through two possible hypothesis: (1) The high prevalence of obesity is a multidimensional problem that depends on several unobserved variables; or (2) in the context of a developed country, a stronger economy does not mean the improvement of certain health indicators. I any case, the obtained results show that an improvement in economic factors is linked to a deterioration of specific health indicators. Finally, we must highlight the fact that R2 = 0.08, this means that the model explains only an 8% of the total variability. We suppose other factors that have not been considered, should help explain the variability in obesity around the world.

To better understand the validity of our results we perform a final analysis that consists on grouping the information in clusters to test whether the variability of the model is improved when the behavior is explained by a multiple number of linear regressions instead of a single conjoint model. Let us create the necessary clusters by using the mean indexes for each country. With this information we compute a hierarchical clustering to categorize our countries.

#Creating distance matrix for Obesity and HDI
clust_data_HD <- na.omit(Mean_HDI_OB[,c(1:3)])
rnames_HD <- clust_data_HD$code
clust_data_HD <- clust_data_HD[,c(2,3)]
rownames(clust_data_HD) <- rnames_HD
distmat_HD <- dist(clust_data_HD,method="euclidean")

#Creating distance matrix for Obesity and GI
clust_data_GI <- na.omit(Mean_HDI_OB[,c(1,3,4)])
rnames_GI <- clust_data_GI$code
clust_data_GI <- clust_data_GI[,c(2,3)]
rownames(clust_data_GI) <- rnames_GI
distmat_GI <- dist(clust_data_GI,method="euclidean")
clust_HD <- hclust(distmat_HD,method="average")
clust_GI <- hclust(distmat_GI,method="average")
plot(clust_HD,hang=-1,cex=0.6)

plot(clust_GI,hang=-1,cex=0.6)

#This graph shows us that we should use k=2
library(cluster)
library(factoextra)
gap_stat_HD <- clusGap(clust_data_HD,FUNcluster=hcut,nstart= 5, K.max= 20, B= 50)
gap_stat_GI <- clusGap(clust_data_GI,FUNcluster=hcut,nstart= 5, K.max= 20, B= 50)

fviz_gap_stat(gap_stat_HD)

fviz_gap_stat(gap_stat_GI)

#Creating the groups 
groups_HD <- cutree(clust_HD,k=2)
table(groups_HD)
## groups_HD
##  1  2 
## 85  1
groups_GI <- cutree(clust_GI,k=2)
table(groups_GI)
## groups_GI
##  1  2 
## 96  8

By using the gap statistic to compute the number of ideal clusters, we obtain a disproportionate number of countries per cluster (85, 1 and 96, 8). This result does not bring any relevant information to our model, we therefore conclude that the clustering is not a useful tool in this case to improve the linear regression on our model. A possible option would be considering a non-linear model for this analysis.

To conclude, a brief overview of the literature on this subject brings some elements of response as to why an increase in HDI actually increases the obesity rates. The main finding is that developed and developing economies are linked to technology innovation which in turn has a negative impact on nutrition health (Bleich & al., 2008) . As stated in the Why is the developed world obese? article, these innovations have been associated with lowered costs of consumption and a sedentary lifestyle, throwing the energy intake balance off.

Discussion

5.1 Conclusion

Through this project we have further investigated the consumption of labeled products in 5 developed countries located in Europe and North America. From this analysis we have contrasted the main nutrition trends with health and economic indicators. From the obesity dataset we have found that he country with the highest prevalence of obesity is USA, with a 32%, and the one with the lowest is Switzerland, with a 19%. From the comparative analysis between the consumption of macro nutrients and the current situation of obesity among the analyzed countries the following conclusions are obtained: (1) The high consumption of sugars could be a relevant proxy at the moment of predicting high obesity rates; (2) Fats and carbohydrates are highly correlated with energy intake, but do not show a link with obesity; (3) A low consumption of proteins and fibers could also be relevant to consider at the moment of studying obesity; (4) There is no relevant difference in the relative consumption of ingredients among countries with high and low obesity levels. Finally, from this analysis we cannot obtain information about the portions and quantities that are consumed per person in each country. This could be a relevant factor when tracing obesity trends.

Regarding our findings in the distribution of organic and non organic, we may see that the consumption of organic products is marginal when we compare it in terms of percentage and absolute values with the non-organic products. Yet, Switzerland and France consume more organic products than the rest of the countries, while USA presents the lowest consumption. When evaluating the distribution of nutrients, we may see that the values of sugars are higher in non-organic products, in comparison with organic ones. Moreover, the value of fiber is lower in non-organic in comparison with organic products. Finally, if we take a look at the health indicators and contrast them with the consumption of organic, we may see that the country with the highest consumption of organic products consistently has the best health indicators. The same link occurs in reverse, namely that the country with the lowest consumption of organic products has relatively worse health indicators than other countries.Though this is an interesting result, we believe there might be other factors affecting both variables and therefore linking them. Namely, wealth could explain a higher range of organic products as well as affect the obesity rates, which we analyzed in the last part.

As a last point, we have presented an analysis to test the behavior in time of obesity against different economic indexes. The results show that there is a weak link between an increase in the obesity prevalence and an increase in the HDI. If we compare the situation of our best and worst nutrition indicator countries (Switzerland and USA) we see that whilst Switzerland has the highest HDI and GI and the lowest obesity rate of all five. USA, on the contrary has a high obesity rate and a lower HDI and GI compared to Switzerland. Finally, we included all of the highly developed countries to make a further comparison between the economic and obesity indicators. By making a linear regression we see that Gross Income per capita does not have an effect over obesity, whilst an increase of 1 point in HDI will increase the obesity rate in 0.3 points. Even though the coefficients in the linear model present a high p-value, it only explains an 8% of the general variability. We note that other aspects are very likely to explain more variability of our model, it may also be that the relationship is not best represented by a linear regression. After trying to incorporate a cluster analysis, no further results were addressed.

5.2 Limitations

As we have mentioned through the document, there are several limitations that we have found in the analysis.

  • The main limitation is the amount of countries, and the static nature of the Food Repo database. This limits the analysis in terms of heterogeneity and over time evolution.

  • The high obesity prevalence is a multivariate problem, that does not only depend on the consumption of labeled products.

  • No causal relationships are found in the analysis of nutrients. In every section we have shown tendencies and possible links, but not causality.

  • We may not generalize any finding to all of the countries that we did not consider in our research.

5.3 Propositions for future work

Through this project we have presented an in-depth analysis of the trends associated with food consumption in different countries of the world. We have been able to spot variability within macronutrients, that could explain differences in obesity prevalence. A proposed future work would be to try to understand the complete eating habits among different populations. As we know, the impact of food in our diets does not only depend on the products, it also depends on the schedules, portions and overall quantities.

In addition to this point, an interesting extra task could be to enlarge the database to understand (with a larger number of observations), the contribution of organic eating in health indicators among eastern and western countries.

Finally, literature indicates that there are other factors that could be linked to obesity. Our final regression model could be complemented by adding new explaining terms to our economic indexes, which could help in the task of explaining more of the variability. The Causes of Obesity (CDC,2022) article cites sleep and activity, among others, as important contributors to being overweight. An analysis of separately sleep and level of activity could be followed by a global and conjoint analysis of these three factors.

References